10 research outputs found
Population variability in the generation and thymic selection of T-cell repertoires
The diversity of T-cell receptor (TCR) repertoires is achieved by a
combination of two intrinsically stochastic steps: random receptor generation
by VDJ recombination, and selection based on the recognition of random
self-peptides presented on the major histocompatibility complex. These
processes lead to a large receptor variability within and between individuals.
However, the characterization of the variability is hampered by the limited
size of the sampled repertoires. We introduce a new software tool SONIA to
facilitate inference of individual-specific computational models for the
generation and selection of the TCR beta chain (TRB) from sequenced repertoires
of 651 individuals, separating and quantifying the variability of the two
processes of generation and selection in the population. We find not only that
most of the variability is driven by the VDJ generation process, but there is a
large degree of consistency between individuals with the inter-individual
variance of repertoires being about 2% of the intra-individual variance. Known
viral-specific TCRs follow the same generation and selection statistics as all
TCRs.Comment: 13 pages, 7 figure, 2 table
On generative models of T-cell receptor sequences
T-cell receptors (TCR) are key proteins of the adaptive immune system,
generated randomly in each individual, whose diversity underlies our ability to
recognize infections and malignancies. Modeling the distribution of TCR
sequences is of key importance for immunology and medical applications. Here,
we compare two inference methods trained on high-throughput sequencing data: a
knowledge-guided approach, which accounts for the details of sequence
generation, supplemented by a physics-inspired model of selection; and a
knowledge-free Variational Auto-Encoder based on deep artificial neural
networks. We show that the knowledge-guided model outperforms the deep network
approach at predicting TCR probabilities, while being more interpretable, at a
lower computational cost
OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs
Motivation: High-throughput sequencing of large immune repertoires has
enabled the development of methods to predict the probability of generation by
V(D)J recombination of T- and B-cell receptors of any specific nucleotide
sequence. These generation probabilities are very non-homogeneous, ranging over
20 orders of magnitude in real repertoires. Since the function of a receptor
really depends on its protein sequence, it is important to be able to predict
this probability of generation at the amino acid level. However, brute-force
summation over all the nucleotide sequences with the correct amino acid
translation is computationally intractable. The purpose of this paper is to
present a solution to this problem.
Results: We use dynamic programming to construct an efficient and flexible
algorithm, called OLGA (Optimized Likelihood estimate of immunoGlobulin
Amino-acid sequences), for calculating the probability of generating a given
CDR3 amino acid sequence or motif, with or without V/J restriction, as a result
of V(D)J recombination in B or T cells. We apply it to databases of
epitope-specific T-cell receptors to evaluate the probability that a typical
human subject will possess T cells responsive to specific disease-associated
epitopes. The model prediction shows an excellent agreement with published
data. We suggest that OLGA may be a useful tool to guide vaccine design.
Availability: Source code is available at https://github.com/zsethna/OLG
Inferring processes underlying B-cell repertoire diversity
We quantify the VDJ recombination and somatic hypermutation processes in
human B-cells using probabilistic inference methods on high-throughput DNA
sequence repertoires of human B-cell receptor heavy chains. Our analysis
captures the statistical properties of the naive repertoire, first after its
initial generation via VDJ recombination and then after selection for
functionality. We also infer statistical properties of the somatic
hypermutation machinery (exclusive of subsequent effects of selection). Our
main results are the following: the B-cell repertoire is substantially more
diverse than T-cell repertoires, due to longer junctional insertions; sequences
that pass initial selection are distinguished by having a higher probability of
being generated in a VDJ recombination event; somatic hypermutations have a
non-uniform distribution along the V gene that is well explained by an
independent site model for the sequence context around the hypermutation site.Comment: acknowledgement adde
Probability, Entropy, and Adaptive Immune System Repertoires
The adaptive immune system, composed of white blood cells called lymphocytes (B and T cells) that circulate in the lymph and blood, is a precision tool that tags and removes foreign peptides. Such peptides, also called antigens or epitopes, are identified by a specific binding to elements of a library or repertoire of unique proteins called receptors (e.g. antibodies or T cell receptors). A repertoire must be large and diverse enough so that at least one receptor will be able to recognize any pathogen epitope the organism is likely to encounter. This diversity is achieved by stochastic rearrangement of the germline DNA to create novel complementarity determining region sequences (CDR3) in a process called called V(D)J recombination.
In this thesis we utilize previously developed generative models of V(D)J recombi- nation events, and infer the model parameters from large datasets of DNA sequences. The generation probability (Pgen) of a nucleotide or amino acid CDR3 is the sum of all model probabilities of V(D)J recombination events that generate the sequence. While previously it was only feasible to compute Pgen of nucleotide sequences, we introduce a novel dynamic programming algorithm that efficiently computes Pgen of amino acid sequences. We use this Pgen for several applications. First we examine how the diversity of a repertoire, characterized by the model entropy, scales with the number of insertions in the V(D)J process. This is used to describe the maturation of the T cell repertoire of mice from embryos to young adults. Next, we introduce a statistical model of hypermutation in B cells and infer the parameters from a human repertoire, providing a principled quantification of the biases in hypermutation rates. Lastly, we examine the statistics of the receptors shared amongst a cohort of more than 600 individual humans and show that the statistics and identities of so-called âpublicâ sequences are determined directly from Pgen.
We highlight possible clinical applications and attempt to place this work in the
context of a full theory of the adaptive immune system
Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector
Abstract Deciphering individual cell phenotypes from cell-specific transcriptional processes requires high dimensional single cell RNA sequencing. However, current dimensionality reduction methods aggregate sparse gene information across cells, without directly measuring the relationships that exist between genes. By performing dimensionality reduction with respect to gene co-expression, low-dimensional features can model these gene-specific relationships and leverage shared signal to overcome sparsity. We describe GeneVector, a scalable framework for dimensionality reduction implemented as a vector space model using mutual information between gene expression. Unlike other methods, including principal component analysis and variational autoencoders, GeneVector uses latent space arithmetic in a lower dimensional gene embedding to identify transcriptional programs and classify cell types. In this work, we show in four single cell RNA-seq datasets that GeneVector was able to capture phenotype-specific pathways, perform batch effect correction, interactively annotate cell types, and identify pathway variation with treatment over time
Fundamental immuneâoncogenicity trade-offs define driver mutation fitness
Missense driver mutations in cancer are concentrated in a few hotspots(1). Various mechanisms have been proposed to explain this skew, including biased mutational processes(2), phenotypic differences(3â6) and immunoediting of neoantigens(7,8); however, to our knowledge, no existing model weighs the relative contribution of these features to tumour evolution. We propose a unified theoretical âfree fitnessâ framework that parsimoniously integrates multimodal genomic, epigenetic, transcriptomic and proteomic data into a biophysical model of the rate-limiting processes underlying the fitness advantage conferred on cancer cells by driver gene mutations. Focusing on TP53, the most mutated gene in cancer(1), we present an inference of mutant p53 concentration and demonstrate that TP53 hotspot mutations optimally solve an evolutionary trade-off between oncogenic potential and neoantigen immunogenicity. Our model anticipates patient survival in The Cancer Genome Atlas and patients with lung cancer treated with immunotherapy as well as the age of tumour onset in germline carriers of TP53 variants. The predicted differential immunogenicity between hotspot mutations was validated experimentally in patients with cancer and in a unique large dataset of healthy individuals. Our data indicate that immune selective pressure on TP53 mutations has a smaller role in non-cancerous lesions than in tumours, suggesting that targeted immunotherapy may offer an early prophylactic opportunity for the former. Determining the relative contribution of immunogenicity and oncogenic function to the selective advantage of hotspot mutations thus has important implications for both precision immunotherapies and our understanding of tumour evolution